Red wine is a popular alcoholic beverage made from grapes. There are many varieties of wine from all over the world. I have always had an interest in learning more about what qualities contribute to the quality of wine. Environmental factors like grapes environment, climate, soil, sunlgiht, and the species or variety of grapes have an effect on the grapes as well as the enological practices can also contribute to the taste and quality of wine. The dataset contains chemical properties for wines and ratings by three experts.
The dataset includes red wine from the vino verde region in Portugal. There are a small variety of grapes from this region used to make red wine. Since the grape variety and the locale is somewhat distinct the data focused on additives to wines as a part of the fermenting process. Do these additives and chemistry involved in wine making have any effect on the quality of the wine? Will a good quality grape always make a great wine? Are there certain chemical properties that make a wine great?
The factors are I believe what was trying to be investigated in the dataset.
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
The dataset contains 1599 observations and 13 variables. At first glance the data seems to have a lot of numerical values that are attributed to wine and an overall quality score.
The metric for each of the variables (based on physicochemical tests): 1 - fixed acidity (tartaric acid - g / dm^3) 2 - volatile acidity (acetic acid - g / dm^3) 3 - citric acid (g / dm^3) 4 - residual sugar (g / dm^3) 5 - chlorides (sodium chloride - g / dm^3 6 - free sulfur dioxide (mg / dm^3) 7 - total sulfur dioxide (mg / dm^3) 8 - density (g / cm^3) 9 - pH 10 - sulphates (potassium sulphate - g / dm3) 11 - alcohol (% by volume) Output variable (based on sensory data): 12 - quality (score between 0 and 10)
The first variable I decided to look at the distribution for was quality. Quality is a number ranking between 0(very bad) and 10 (very excellent) that is based on a sensory test of at least 3 wine eperts. The experts graded the wine quality and the median value of the evalautions was taken to be used in the data set.
To clean up the data set I removed the X variable from the dataset.
The quality data distribution looks normally distributed. With the majority of the data forming in the center at mean and median and a few on the bad end and a few on the excellent end. One interesting thing to note is not one wine was considered very bad (1) or very excellent(10). This may be due to the fact that the median score was taken for each of the samples.
Since I will be using quality frequently in graphs I decided to change the variable to a factor.
I would like to take a look at the variables in groups. I decided to group each variable based what I found are similarities in their profile.
First is alcohol, density, and residual sugar. These three items are inevitable products of the fermentation process which is why I grouped them together.
The distribution of alcohol, the percentage of alcohol in a wine, looks skewed. Density, the density of water (1 gram per cubic centimeter) is close to that of pure water. THe residual sugar as definded as the amount of sugar remaining after fermentation stops has a skewed distribution. It is interesting that the density, alcohol, and residual sugar do not have similar distributions. There is slight similarity between residual sugar and alcohol, both are skewed.
Wine’s alcohol content (%) can be ranked as low (below 10%), medium low (10-11.5%), medium (11.15-13.5%), mediuam-high (13.5-15%), and high 15%. Looking further into the statistics on the alcohol in the dataset, the maximum amount of alcohol is 14.9 and the minimum is 8.4.I will transform the scale of the alcohol content (%), using the logarithm, to see if there is anything else that we can understand about it.
I didn’t find that the logarithm changed the distribution very much. The one thing that’s interesting to note is that at 1 it suddenly drops quickly.
It’s interesting in our dataset has all levels of alcohol. I would assume that the wine from the same grape variety would have a much closer distribution of alcohol content (%). I want to look at the distribution of wine more closely to see if there is any more information by looking at the box plot.
Almost half of the sample has wine in the low alcohol content (%). I noticed that the percentage of alcohol by volume varies by up to 5%. This would be a point I would like to learn about further and I think more information about the wine and grape variety would be helpful.
Since residual sugar is so skewed I would like to look further at the distribution to see if I have any further information to gather from it. I found some information about residual sugar from wine folly, from 0-9 grams is considered dry and from 9-18 grams is considered off dry. All of the wine is dry and off dry as the max value of residual sugar is 15.5 g. There are wines that fall in both categories.
The outliers in the residual sugar are interesting. There seem to be many but they are all considered dry wines so the variation is not as drastic as it seems. I would like to compare the distribution further. One question I have is why does it vary so much? More information about would be needed about residual sugar to due further analysis at a later time.
First I decided to look at alochol and density a little closer.
There is an obvious trend, wine’s density decreases as the alcohol content (%) increases. Alcohol is less dense than water so it would make sense that the density would decrease as the alcohol content (%) increases. I noticed that there is a lot of dipsersion. I would like to look at the graph more closely. Specifically let’s look at when the alcohol content is less than 10%.
The density of pure water is 1 and the density of drinking alcohol is about .789. The distribution of density for wines when alcohol content (%) is less than 10% is mainly between .996 and .999 with a few outliers on both sides. I would like to now add residual sugar to the original graph to see if I can learn a little bit more about the density and the alcohol of the wine. Now I would like to add residual sugar to see if that has any effect on density.
To make the graph look clearer I decided to exclude some of the farther outliers. There is a slight lightening above the trend line but I am not sure it’s significant. Obviously a higher residual sugar should create more dense wines. The graph somewhat reflects that. You can see the wine above the rend line appears to be a lighter blue the beneath the trend line.
Now I would like to look at the distribution of alcohol and quality next to see how that works out.
It looks like the distribution of alcohol is pretty wide for all of the quality scores. You can see that while distribution is wide clearly alcohol content (%) increases with the score. The mean and median is represented by in the graph as the red star and line respectively. There is an obvious increase in alcoohol content (%) in the higher quality wines while the lower quality wines Now I would like to look at the actual statistical figures for each mean and median of each quality score further to see if this trend is reflected in the descriptive statistics.
## # A tibble: 6 x 4
## `red_wine$quality` alcohol_mean alcohol_median n
## <fctr> <dbl> <dbl> <int>
## 1 3 9.955000 9.925 10
## 2 4 10.265094 10.000 53
## 3 5 9.899706 9.700 681
## 4 6 10.629519 10.500 638
## 5 7 11.465913 11.500 199
## 6 8 12.094444 12.150 18
It does look like the quality score is increasing as the alcohol content (%) increases. I would like to look at the correlation. I would like to depict the alcohol by quality score graphically using histograms.
##
## Pearson's product-moment correlation
##
## data: red_wine$alcohol and red_wine$quality_original
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
There is a medium correlation between alcohol and quality which is reflected in the information I have already looked at.
Next I would like to go back to volatile acidity, fixed acidity, and PH. To assess how they relate to one another. I grouped these together due to their relationship with acidity.
Volatile acidity is “the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste”. The distribution of the volatile acidity looks normal with a slight skew. pH is how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale. The pH for the red wine is within the normal range of wine with only a few wines above and beyond that range. Fixed acidity is described as “most acids involved with wine or fixed or nonvolatile (do not evaporate readily)” it looks evenly distributed. Since volatile acidity and pH would most likely have effects the taste so I would like to examine these two together first.
While there are a few slightly more basic and slightly more acidic wines. The dispersion seems limited only between 2.8 and 4 which is mostly in the range 3-4 that was originally identified as normal for wine. I don’t believe the wines outside of the normal range are significantly outside the range and are in a significant number to have an effect on the qualtiy score but I would like to keep this in mind as I analyze further.
As the description states volatile acidity creates the vinegar taste in wine. Let’s see if the volatile acidity has any relationship with pH.
Actually the volatile acidity and the pH have the opposite relationship that I expected. The higher volatile acidity has a higher pH. More research would need to be done to understand the volatile acidity and pH to understand this relationship further. Maybe fixed acidity will have a relationship with pH more like how I expected.
As the acidity increases the pH decreases (meaning that the wine becomes more acidic) as expected. I now would like to look the fixed acidity and the volatile acidity to see the relationship between the two.
The fixed acidity and volatile acidity distribution is interesting. I expected fixed acidity to be positively correlated with volatile acidity. Further research needs to be done to further explain this relationship.
I would like to look at these variables relationship with quality.
There are lots of dispersion in the data with almost every quality score having at least one outlier in the data and a relatively wide distribution. You can see that the mean and median of the volatile acidity decreasing to .4 for the higher score wines. I would like to now look at the fixed acidity to see if it compares.
The distribution is again very dispersed but the median and mean are all around 8 g/ dm^3. The mean and median of the data look relatively close for each data. I would expect a low correlation between fixed acidity. As compared to volatile acidity which clearly shows that there is a decrease.
The last set I chose are the additives to the wine that I grouped together. Sulphates, Total Sulfur Dioxide, and Citric Acid help to keep the wine fresh and preserve it’s flavor.
Sulphates are a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant. It has a skewed distribution. Total sulfur dioxide is “the amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine”. This is also a skewed distribution. Citiric Acid “found in small quantities, citric acid can add ‘freshness’ and flavor to wines”. It has a very interesting distribution. It seems to be bimodal with a skew.
Next I wanted to look at the normalized the distribution of total sulfur dioxide by taking the log.
I would like to use the converted logarithm distribution to analyze the data with some of the other variables. The log helps to normalize the distribution but really doesn’t give my insight but I would like to use this further.
There is a slight increase in sulphates as the converted sulfur dioxide increrases. It is very minor and would not consider this a notable find.
There is no correlation between citric acid and total sulfur dioxide. The distribution is very dispersed. Almost evenly.
It is an interesting finding that citric and sulphates the distribution seems to have a slight upward slope. This finding isn’t really useful and doesn’t make me have any additional questions.
Now I would like to look at the distribution when comparing quality and some of the variables.
Sulphates look to be positively distributed with the quality. The increase in median and mean for each quality score is depicted by the line and star respectively on the graph. Sulphates must be an important indicator for wine quality. More research to understand sulphates is needed in the future to completely understand why this has this effect.
I would like to see if box plots can catch any trends with the sulfer dioxide and quality to compare sulphates and quality. .
This box plot looks as though total sulfer dioxide doesn’t have a strong relationship with the quality. The Total Sulfur Dioxide mean/median increases but the quickly decreases as the score increases.
Now I would like to explore some variables further by creating some multivariate graphs that will help identify any relationships between several variables.
Let’s look at the correlation a little it more closely between all of the variables. I decided that the graph with the circles make it look best.
The strongest correlations appears to be between total.sulfur dioxide and free sulfur dioxide and volatile acidity and citric acid.
It’s interesting that none of the variables appear to have a strong relationship quality. The strongest relationship appears to be between quality and alcohol content (%). Among these variables only citric acid and volatile acidity seem to have a relationship. I would like to look at that relationship a little bit more closely especially because citric acid was not one of the graphs I initially analyzed.
Citric Acid and volatile acidity appear to have a clear trend. One interesting part of the distribution is when citric acid is 0 volatile acidity has dispersion from approximately .35 to 1.65 g / dm^3. This would be another point that I would like to research further.
I would like to look at the top two quality correlated variables to see how the distribution looks before I attempt to create a model.
You can see that the distribution of the lower scores(scores 5 or below) appear to be on the high volatile acidity, low alcohol wines. Whereas you can see the higher scores around the top left side of the graph.
##
## Calls:
## m1: lm(formula = I(quality_original) ~ I(alcohol), data = red_wine)
## m2: lm(formula = I(quality_original) ~ I(alcohol) + sulphates, data = red_wine)
## m3: lm(formula = I(quality_original) ~ I(alcohol) + sulphates + volatile.acidity,
## data = red_wine)
##
## ==============================================================
## m1 m2 m3
## --------------------------------------------------------------
## (Intercept) 1.875*** 1.375*** 2.611***
## (0.175) (0.177) (0.196)
## I(alcohol) 0.361*** 0.346*** 0.309***
## (0.017) (0.016) (0.016)
## sulphates 0.994*** 0.679***
## (0.102) (0.101)
## volatile.acidity -1.221***
## (0.097)
## --------------------------------------------------------------
## R-squared 0.227 0.270 0.336
## adj. R-squared 0.226 0.269 0.335
## sigma 0.710 0.690 0.659
## F 468.267 294.988 268.912
## p 0.000 0.000 0.000
## Log-likelihood -1721.057 -1675.142 -1599.384
## Deviance 805.870 760.894 692.105
## AIC 3448.114 3358.284 3208.768
## BIC 3464.245 3379.793 3235.654
## N 1599 1599 1599
## ==============================================================
The fit of the model is not very good. The exploratory analysis I found that alcohol, sulphates, and fixed acidity showed the strongest relationship with quality. My research shows that there are a variety of factors that effect the quality of wine that were not included in this dataset so that is one possibility. Now I would like to add more variables to see if that will help make the model appear to be a better fit.
##
## Calls:
## m1: lm(formula = I(quality_original) ~ I(alcohol), data = red_wine)
## m2: lm(formula = I(quality_original) ~ I(alcohol) + volatile.acidity,
## data = red_wine)
## m3: lm(formula = I(quality_original) ~ I(alcohol) + volatile.acidity +
## sulphates, data = red_wine)
## m4: lm(formula = I(quality_original) ~ I(alcohol) + volatile.acidity +
## sulphates + citric.acid, data = red_wine)
## m5: lm(formula = I(quality_original) ~ I(alcohol) + volatile.acidity +
## sulphates + citric.acid + total.sulfur.dioxide, data = red_wine)
##
## ==============================================================================================
## m1 m2 m3 m4 m5
## ----------------------------------------------------------------------------------------------
## (Intercept) 1.875*** 3.095*** 2.611*** 2.646*** 2.843***
## (0.175) (0.184) (0.196) (0.201) (0.205)
## I(alcohol) 0.361*** 0.314*** 0.309*** 0.309*** 0.295***
## (0.017) (0.016) (0.016) (0.016) (0.016)
## volatile.acidity -1.384*** -1.221*** -1.265*** -1.222***
## (0.095) (0.097) (0.113) (0.112)
## sulphates 0.679*** 0.696*** 0.721***
## (0.101) (0.103) (0.103)
## citric.acid -0.079 -0.043
## (0.104) (0.104)
## total.sulfur.dioxide -0.002***
## (0.001)
## ----------------------------------------------------------------------------------------------
## R-squared 0.227 0.317 0.336 0.336 0.344
## adj. R-squared 0.226 0.316 0.335 0.334 0.342
## sigma 0.710 0.668 0.659 0.659 0.655
## F 468.267 370.379 268.912 201.777 166.962
## p 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -1721.057 -1621.814 -1599.384 -1599.093 -1589.749
## Deviance 805.870 711.796 692.105 691.852 683.814
## AIC 3448.114 3251.628 3208.768 3210.186 3193.499
## BIC 3464.245 3273.136 3235.654 3242.448 3231.138
## N 1599 1599 1599 1599 1599
## ==============================================================================================
It doesn’t look like adding additional variables helped to increase the fit of the model.
The data set contains red wine from a region in portugal. The data contains 1599 observations and 13 variables. I looked more closely at about 9 of the variables within the dataset. The most important variables in wine quality seem to be alcohol content (%), volatile acidiity, and sulphates. Some final graphs depicting these top variables:
## $y
## [1] "Volatile Acidity"
##
## attr(,"class")
## [1] "labels"
## List of 1
## $ plot.title:List of 11
## ..$ family : NULL
## ..$ face : chr "bold.italic"
## ..$ colour : chr "black"
## ..$ size : num 14
## ..$ hjust : num 0.5
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi FALSE
## - attr(*, "validate")= logi TRUE
Plot 1 looks at the distribution of the relationship between volatile acidity layering in the mean and median on a box plot. After looking at the correlation betweeen all of the variables. This was one of the stronger relationships and I felt the graph really shows the relationship well. For each quality score there are plots which depict the distribution of the volatile acidity score. Additionally there are box plots which have a line for the median and a red star for the mean volatile acidity. THe box plot helps clarify the distribution and makes and makes the midpoint of the dataset clear.
This graph depicts a histogram the alcohol content count for each quality score. I chose this graph because it truly gives you an idea of the distribution of the quality and alcohol scores. THis graph is great because it provides multiple insights. On the total distribution of the quality scores and the alcohol content combined. It looks like the majority of the distribution is scores is wihtin the 5 and 6 quality scores. Further the 5 score is shown to have a skewed disribution of alcohol content with the majority of the distributio under 10% where as a 6 score has a more even distribution. One thing that this graph really depicts is that the low quality wine and higher quality wine and most wine seems to fall in the middle to above average quality. This point could indicate some issues.
Graph 3 depicts the relationship between sulphates, alcohol, and quality. This is a scatter plot of the Sulphates vs Alcohol Content and added a layer of color utilizing quality score for color and also trend lines. I only utilized data of up to 99% to remove some of the outliers. I found it helpful to add a theme and trend lines to make the graph more aesthetically pleasing than the original graph. What this graph best depicts is the difference between higher sulphates, higher alcohol account and the low sulphate, low alcohol content. I feel this is important because it does a good job of showing that there is a relationship between these elements. Even if the relationships are not that strong.
While the chemical relationship between quality of wine and variables like alcohol content (%), sulphates, etc is important it is not the only factor that is a part of wine production that has effect on quality. As I review my analysis I find that the chemical relationship with quality score was not a great fit. Even if I utitized all of the variables in the data set the model would not account for much indicate. Some questions that come to mind about the dataset:
Could it be the dataset? I am not a wine expert but there were not many high quality score wines in the dataset. Most of the wine’s seem to be middle quality. Maybe a larger dataset with a wider variety of wine quality would give a different more comprehensive result.
Could there be external factors not included in dataset that effect the quality of the wine? Using wine from the same region, as was done in the dataset, would limit this effect but it is still possible. Without further investigation and research it is difficult to know. This would be something that would definitely need to be investigated further.
These questions are more would have to be answered with further statistical analysis and research to comprehensively understand the relationship between quality of wine and the chemical properties of the wine.